Skip to main content

Alerting

Introduction

The PortX platform provides built-in alerting that monitors PortX-managed services running in your cluster — this includes platform components like Grafana, ArgoCD, Karavan, Istio, and the underlying infrastructure (nodes, pods, storage). If any PortX-managed service has an issue, our platform engineering team is automatically notified and will respond.

Your applications are not monitored by PortX by default. We do not create alerts for tenant-deployed workloads — that is your responsibility. However, we provide the tooling and a working template to make it straightforward. Every tenant has the ability to create their own custom alerts through two paths:

  • Metric-based alerts (Prometheus) — configured via your tenant GitOps repository using the tenant-alerts application and the Prometheus Operator
  • Log-based alerts (Loki) — configured through the Grafana UI

Purpose

This document provides information and step-by-step guidance for the following topics:

  • Understanding what the platform monitors out of the box
  • Creating custom metric-based alerts with PrometheusRules
  • Routing alert notifications to your team (email, Slack, OpsGenie, webhooks)
  • Creating log-based alerts through Grafana
  • Reusing existing platform alerts for your namespace

Initialisms

InitialismDefinition
HPAHorizontal Pod Autoscaler
OOMOut of Memory
PVCPersistent Volume Claim
LogQLLoki Query Language
PromQLPrometheus Query Language


What the Platform Monitors

The following alerts are active on every tenant cluster. These are managed by PortX — no configuration is required on your part.


Application and Pod Health

ConditionSeverityDescription
Crash LoopingCriticalA container is repeatedly crashing and restarting
Pod Not ReadyWarningA pod has been unable to start for more than 15 minutes
Out of MemoryWarningA container was terminated because it exceeded its memory limit
Image Pull FailureWarningA container image could not be pulled from the registry
Deployment Replicas MismatchWarningRunning pods do not match the desired count for more than 15 minutes
Rollout StuckCriticalA deployment update has stalled and is not progressing

Infrastructure

ConditionSeverityDescription
Node Not ReadyCriticalA cluster node is unresponsive
Memory / Disk PressureWarningA node is running low on memory or disk
PVC Above 90%WarningA persistent volume is nearly full
PVC Above 95%CriticalA persistent volume is critically full

Platform Services

Core services including Grafana, ArgoCD, Prometheus, Loki, Tempo, Istio, and Karpenter are all monitored. If any of these become unavailable, the platform team is alerted immediately.


Autoscaling

ConditionSeverityDescription
HPA Maxed OutWarningThe autoscaler has been at maximum replicas for more than 15 minutes
HPA Not ScalingWarningThe autoscaler cannot reach the desired replica count


Creating Metric-Based Alerts (Prometheus)

Every tenant GitOps repository includes a tenant-alerts application under the apps/ directory. This is where you define custom Prometheus alerts using the Prometheus Operator.

The tenant-alerts app uses the prom-alert-rules Helm chart which creates two Kubernetes resources:

  • PrometheusRule — defines the alert conditions using PromQL
  • AlertmanagerConfig — defines where notifications are sent and how alerts are routed

Step 1. Define Your Alert Rules

Edit apps/tenant-alerts/values.yaml in your tenant GitOps repository. The prometheusrule section is where you define what conditions should trigger an alert.


Example: Alert when a deployment has zero running pods

prometheusrule:
enabled: true
name: my-app-alerts
labels:
release: portx-monitoring
application: my-app
groups:
- name: deployment
rules:
- alert: DeploymentAt0Replicas
expr: |
sum(kube_deployment_status_replicas{
pod_template_hash=""
}) by (deployment, namespace) < 1
for: 1m
labels:
app: my-app
annotations:
summary: "Deployment {{$labels.deployment}} has no running pods"
description: |
Cluster Name: {{$externalLabels.cluster}}
Namespace: {{$labels.namespace}}
Deployment name: {{$labels.deployment}}

Example: Alert when request error rate exceeds 5%

prometheusrule:
enabled: true
name: my-app-alerts
labels:
release: portx-monitoring
application: my-app
groups:
- name: http-errors
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{
response_code=~"5.*",
destination_service_namespace="prod"
}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{
destination_service_namespace="prod"
}[5m])) by (destination_service_name)
> 0.05
for: 5m
labels:
severity: critical
app: my-app
annotations:
summary: "High 5xx error rate on {{$labels.destination_service_name}}"
description: "Error rate is above 5% for the last 5 minutes."

Example: Alert when response latency is too high

prometheusrule:
enabled: true
name: my-app-alerts
labels:
release: portx-monitoring
application: my-app
groups:
- name: latency
rules:
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
destination_service_namespace="prod"
}[5m])) by (le, destination_service_name)
) > 2000
for: 10m
labels:
severity: warning
app: my-app
annotations:
summary: "P99 latency above 2s on {{$labels.destination_service_name}}"

note

PrometheusRules must include release: portx-monitoring in their labels to be picked up by the Prometheus Operator. Without this label, your rules will be ignored.


Key Fields

FieldDescription
exprThe PromQL expression that defines the alert condition. When this expression returns results, the alert fires.
forHow long the condition must be true before the alert fires. Prevents flapping on brief spikes.
labelsLabels attached to the alert. Use severity: critical or severity: warning. Use app to tag your application.
annotations.summaryShort description shown in notifications. Supports Go template variables like {{$labels.deployment}}.
annotations.descriptionDetailed description. Include cluster, namespace, and relevant context.


Step 2. Configure Notification Routing

The alertmanager section in the same values.yaml file defines where alert notifications are delivered and how they are grouped.


Example: Route alerts to your team via email

alertmanager:
enabled: true
name: my-app-alerts
labels:
alertmanager: portx-alertmanager

receivers:
- name: 'my-team'
emailConfigs:
- to: your-team@example.com
from: noreply@portx.io
smarthost: smtp.sendgrid.net:587
authUsername: apikey
authPassword:
name: grafana-client-secret
key: GF_SMTP_PASSWORD

route:
groupBy: [job]
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
global_receiver: 'my-team'
routes:
- matchers:
- matchType: =
name: alertname
value: DeploymentAt0Replicas
receiver: 'my-team'

Example: Route alerts to a Slack channel

  receivers:
- name: 'my-team-slack'
slackConfigs:
- apiURL:
name: my-slack-secret
key: webhook-url
channel: '#my-app-alerts'
sendResolved: true
title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: |
*Namespace:* {{ .CommonLabels.namespace }}
*Severity:* {{ .CommonLabels.severity }}
{{ range .Alerts }}*Description:* {{ .Annotations.description }}
{{ end }}

Example: Route alerts to OpsGenie

  receivers:
- name: 'my-team-opsgenie'
opsgenieConfigs:
- sendResolved: true
apiKey:
name: my-opsgenie-secret
key: api-key
apiURL: "https://api.opsgenie.com"
message: "{{ .CommonLabels.alertname }}"
priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else }}P3{{ end }}'

note

AlertmanagerConfig resources must include alertmanager: portx-alertmanager in their labels. Without this label, the platform AlertManager will not recognize your routing configuration.


Routing to Existing Platform Alerts

You do not need to create new PrometheusRules to get notified about common issues. The platform already fires alerts like KubePodNotReady and PodCrashLoopBackOff. You can route these existing alerts to your own receivers by matching on the alert name and your namespace:

alertmanager:
enabled: true
name: my-app-alerts
labels:
alertmanager: portx-alertmanager

receivers:
- name: 'my-team'
emailConfigs:
- to: your-team@example.com
from: noreply@portx.io
smarthost: smtp.sendgrid.net:587
authUsername: apikey
authPassword:
name: grafana-client-secret
key: GF_SMTP_PASSWORD

route:
groupBy: [job]
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
global_receiver: 'my-team'
routes:
- matchers:
- matchType: =
name: alertname
value: KubePodNotReady
- matchType: =
name: namespace
value: prod
- matchType: =~
name: pod
value: my-app-*
receiver: 'my-team'

This sends you an email whenever any pod matching my-app-* in the prod namespace is not ready — using the platform's built-in alert, routed to your team.



Creating Log-Based Alerts (Grafana + Loki)

For alerts based on log content (error messages, specific log patterns, log volume), you create alert rules through the Grafana UI. These alerts query Loki using LogQL and are evaluated by the Grafana alerting engine.


Step 1. Open Grafana Alerting

Navigate to your Grafana instance at:

https://tools.<your-tenant>.tenants.portx.io/grafana/alerting/list

In the left sidebar, click Alerting (bell icon), then Alert rules.


Step 2. Create a New Alert Rule

  1. Click + New alert rule
  2. Give the rule a name (e.g., "Error log spike — my-app")
  3. In the Define query and alert condition section:
    • Select the logs data source (this is your Loki instance)
    • Write a LogQL query

Example: Alert on error log volume

sum(count_over_time({namespace="prod", app="my-app"} |= "ERROR" [5m])) > 10

This fires when more than 10 error logs appear in a 5-minute window.


Example: Alert on a specific error message

count_over_time({namespace="prod", app="my-app"} |= "database connection refused" [5m]) > 0

Example: Alert on high log volume (possible log storm)

sum(rate({namespace="prod", app="my-app"}[5m])) > 100

This fires when your app is producing more than 100 log lines per second.


Step 3. Set Evaluation Behavior

  • Evaluate every: How often the query runs (e.g., 1m)
  • For: How long the condition must be true before firing (e.g., 5m)
  • Folder and Group: Organize your alerts into folders (e.g., "My App Alerts")

Step 4. Configure Notifications

In the Notifications section:

  1. Select an existing contact point or create a new one
  2. Contact points support: email, Slack, OpsGenie, PagerDuty, webhooks, and more
  3. Add labels (e.g., severity=warning, team=my-team) for routing

Step 5. Save and Enable

Click Save rule and exit. The alert will begin evaluating immediately based on your schedule.


tip

Use Grafana's Explore view to test your LogQL queries before creating alert rules. Navigate to Explore, select the logs data source, and run your query to verify it returns the expected results.

warning

Log-based alerts are evaluated by Grafana, not by the Prometheus Operator. This means they are managed entirely through the Grafana UI and are not stored in your GitOps repository. If you need version-controlled, GitOps-managed alerting, use metric-based Prometheus alerts instead.



Summary

Alert TypeWhere to ConfigureQuery LanguageGitOps Managed
Metric-based (Prometheus)apps/tenant-alerts/values.yaml in your GitOps repoPromQLYes
Log-based (Loki)Grafana UI → Alerting → Alert rulesLogQLNo
Platform built-inNo configuration needed — active by default